1. Introduction

Creating a ranking of the 100 greatest musical artists of all time is an ambitious and highly subjective task. Nevertheless, that is what the Rolling Stones Magazine attempted to achieve in 2010 with the publication of the “100 Greatest Musical Artists of All Time”.

A decade later, this research endeavors to explore how has the music of these revered artists endured at the close of 2023.The Analysis will then be divided in three steps:

  1. Analyzing Popularity: Evaluate ranked artists’ popularity on Spotify at the end of 2023 and investigate correlations between ranking position, current popularity, and Spotify track popularity.
  2. Time Impact of Releases: Examine the impact of older and newer releases on track and artist popularity, considering the artist’s status, assessing if they are active, performing, or deceased.
  3. Audio Features Analysis: Compare audio features of top tracks of today’s most popular artists, with those from the Rolling Stones ranking.

2. Data

Various data sources were used to explore the popularity of the 100 Rolling Stones ranked artists. This involved web scraping, interaction with the Spotify API, and merging datasets to create a comprehensive analysis dataset.

2.1 Data Sources
  1. Sources and Data retrieved via webscarping techniques
  1. Other Sources and Data retrieved
  • Spotify API: Utilized the spotifyr package for artist searches, top track retrieval, and audio feature extraction.
2.2 Data Processing Steps:

The data processing steps involved, in general, retrieving the data via webscraping or via the Spotify API, and treating the datasets in a way to maintain either the artists id, the track id or a combination of both as a primary key to merging the information. After the processing of the information, the data was then, saved as a RDS file that could be later loaded, avoiding unnecessary calls to the API or to the webserver.

The data processing steps involved, in general, retrieving data via web scraping or the Spotify API, treating datasets to maintain artist or track IDs as primary keys for merging. Key decisions included treating Parliament and Funkadelic as separate bands, using RDS files for data storage, and focusing on the US market for top tracks.

Some of the key decisions made during this stage were: - The 58th position on the Rolling Stones Ranking is Parliament and Funkadelic, however, despite occupying one position on the ranking, those were two separate band that only recently became one. Therefore, it was decided that they would be treated as separate bands to portray their most famous songs (Parliament and Funkadelic), so the final ranking presents 101 artists. - The choice to use an RDS file and not a relation database is centered on the fact that the information from the Spotify API involved dataframes that contained lists, hence, to create a SQL database would require the flattening of the variable, which could involve a more complex work that might not provide better results ahead. - The analysis of the songs considers that the main artist of the track is the artist that is being analysed, hence other artists featuring on the same track are disconsidered. - The use of the US market for the top tracks rely on the fact that the Rolling Stones Magazine is american, hence, the artists’ current popularity would be more meaningful for that country, as for other countries some artists might have never been famous at all.

This integrated dataset offers a succinct yet comprehensive foundation for exploring patterns and trends in the popularity and characteristics of Spotify’s most popular artists and their top tracks in the United States.

3. Analysis

3.1 Popularity Analysis

We start this analysis by first examining the popularity index provided by the Spotify API, which ranges from 0 and 100, with 100 being the most popular, and is based on the each track’s popularity. The tracks’s popularity also varies between 0 and 100, however, its value is based, in the most part, on the total number of plays the track has had and how recent those plays are. Although we cannot assess the specific weight that is being given to the recentness of the plays of a specific track, the documentation explicitly states that “songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past”.

Hence, it is fair to consider that spotify’s popularity index is aligned with the main goal of this study, to retrieve the artist’s relevance at the end of 2023. It is also important to highlight that, since there is no exct dat threshold, the fact that the popularity index is not updated in real time does not limit its usage.

Assuming a normal distribution of popularity among Spotify artists, this study will also classify those with a popularity indx over 50 as “Popular”, while those below are “Unpopular”. WIth this assumption, the interactive plot below portrais each of the artists according to their position in the Rolling Stones Ranking on the x axis in comparison to their current popularity on Spotify. The colors of the dots displays if the artist is considered Popular or not and the red line show the trend line of the two variables.

It is clear that, the vast majority of artists in the ranking can be considered popular by the end of 2023. In other words, considering that the artist’s popularity takes into account the popularity of their songs, the graph above already demonstrates that, most of the songs of the artists on the Rolling Stones ranking has, indeed, endured over time, making them still popular in 2023.

The red regression line also displays the negative correlation between the ranking and the popularity, which is expected given the inverse relationship between these variables. Nevertheless, the slight tilt of the regression line indicates a lack of correlation between the Rolling Stones Magazine ranking and Spotify popularity, implying that a better ranking position does not necessarily translate into a better popularity in 2023.

From the plot above and the statistical summary of these artists’ popularity below, it is also possible to notice that the distribution is not heavily skewed, given that the median is close to the mean, the data is farily diversed distributed, without any big outliers that can be spotted based on the visual inspection of the plot.

In more detail, it is also possible to find information from the top 10 tracks of an artist on the Spotify API in a specific market. Given that Rolling Stones is a North American magazine, it is expected that the artists featured in their rankings were, at a certain point in time, quite well-known in the United States. Therefore, a decline in popularity would carry genuine significance. In contrast, in other markets, those same artists might never have achieved fame, rendering their current popularity less meaningful in those regions.

Therefore, using the information from the top tracks in the United States for the artists on the Rolling Stones ranking, the plot below portrays the popularity of each artist’s tracks, along with the average track popularity and the artist’s overall popularity.

From the plot above, it can be seen that, if an artist’s top tracks has a bigger average Popularity, then, usually, the artist will also present a bigger popularity too, although the average popularity of the top tracks is never greater than the general Artist Popularity. Another interesting aspect is the presence of outliers on the tracks’s popularity variable, specially among the less popular artists.

Using the Interquartile Range to identify possible outliers, the table below shows that the average popularity of the artists that have a positive outlier, is below the Average Popularity of all the artists that are part of the Rolling Stones Ranking. This would suggest that the most popular artists of today are not the ones with one time hits, but the ones who are able to have multiple musical hits.

Positive Outliers Summary
Total Positive Outliers Unique Artists With Positive Outliers Mean Artist Popularity Median Artist Popularity Mean Difference to Average Track Popularity Max Difference to Average Track Popularity
34 29 61.11765 62.5 14.69118 42.1
3.2 Time and Artist’s Status Impact on Popularity

Two other factors that can potentially impact an artist’s current popularity are the antiqueness of the release of the song and the status of the artist. Regarding time, one would expect that older songs would fade over time, implying that that the current relevence of a song could heavily rely on the recency of its release. Concerning the artist’s status, if the band no longer exists or if the artist had passed away, then they would not be producing new music, which could potentially decrease the artist’s popularity over time.

To examine these two aspects, top tracks’ album release dates from the Spotify API, and the artists’ lifespan information from the MusicBrainz.org data were combined. This dataset allowed this study to analyse at the time of each top track’s release if the if the artist had already passed away or if the band or group had been dissolved.

For the plots below, the tracks that were released while the artist or group was still active are identified as a “Regular Release”, while the others are “Releases of Non Active Artist”. The plots are divided into Popular and unpopular artists. The axis show the time of the release of each of the top tracks of the artist, and the popularity of the track on the y axis.

The first noticeable aspect of the plots is the concentration of songs released prior to 1995, but this is expected given that most of the artists in the Rolling Stones Ranking are historical figures.

For the first plot, the regression lines are close to parallel to the x axis, which would indicate that no correlation can be found, hence no major conclusions can be made.

The second graph, on the other hand, displays more relevant correlation lines, indicating a connection between song release time and track popularity. However, the opposing directions of the lines suggest that more recent releases for non-active artists improve their popularity, while having the opposite effect for active artists. Hence, one possible find is that the end of the band or group could help improve the popularity of the artist over time. Perhaps, this is due to the fact that regular releases can be “New Tracks” while others would be relaunches of old time successes that could have a “guaranteed” better performance. Surely, no causality claims can be made, but this conclusion is also aligned with the previous find that the popular artists are the ones that have multiple hits, and not the ones with just one big hit.

3.3 Audio Feture Analysis

To conclude this study, the audio features of the tracks are then analysed. Are the successes of today different from the songs of the ranked Artists? To answer, first we complement the dataset of artists based on the ChartMasters’ ranking of the 50 most popular artists on Spotify (as of 29/12/2023),then use the Spotify API to gather audio features information for the the top tracks of the Most Popular Artists on Spotify (per ChartMasters), the Popular Artists on the Rolling Stones Ranking and the Unpopular Rolling Stones Ranking Artists. The plots below show the different density distributions of each of the Audio Features, and the colors represent each of the artists popularity category.

Due to the limitation of words in this study, this analysis will focus on the features that present significant differences among the three categories of artists. For more information on each audio feature, refer to the Spotify API Documentation.

Regarding the plots, some differences can be highlighted, especially in the danceability, energy, loudness, accousticness, liveness and valence features. In summary, when compared to the Rolling Stones’ artists, the most successfull musics of today present a higher danceability (around 0.75 energy level), they are quieter, less accoustic, are released in the studio version (lower liveness) and present a lower valence. It is also noteworthy that, for the energy, loudness, accousticness and liveness features, the popular artists on the Top 100 Ranking are more closely aligned to the distribution of the most popular artists of today. This suggests that, indeed, the audio features of the song may have an impact on the enduring of the tracks and of the artists.

The difference in the Valence feature is also interesting to highlight. According to Spotify, this feature portrays the overall positiveness of the song, whereas tracks with low valence sound more negative (e.g. sad, depressed, angry). This study does not have any evidence, but further analysis could investigate the relation of this with broader societal issues over time.

3.4 Conclusion

In conclusion, this study aimed to analyse the endurance of the 100 best artists of all time according to the Rolling Stones magazine. The research found that, the majority of the artists on the ranking and their songs, indeed have succeeded on their battle against time, and are still popular at the end of 2023. To remain popular, the analysis suggests that the artists and songs that managed to better endure are the ones who are able to have multiple musical hits, or who have re-realeased old successes. The more popular artists also are the ones of songs with a higher danceability, the right amount of energy, less intense (quieter), not in an accoustic or live version and with a lower valence. This study focused on correlation and can not make any assumptions on causality, therefore, further studies could contribute to this conclusions by analysing eventual causality of these aspects and verifying if the conclusions remain consistent when other countries are taken into account.

Appendix: All code in this assignment

knitr::opts_chunk$set(echo = TRUE)

#library(ggplot2)
#library(ggpubr) #for stat_cor
#library(corrplot) #for correlation table
#
#library(viridis) #scale color viridis
#library(hrbrthemes)



#################### Final Packages
library(rvest)
library(dplyr)
library(RSelenium)

library(spotifyr)
library("jsonlite")
library(dplyr)

library(ggplot2)
library(plotly) #making the ggplot interactive
library(knitr) #For printing table of Outlier Summary

library(tools)
library(patchwork)
# ############### Getting Rolling Stones Ranking Info
# 
# #Retrieving the information from the Rolling Stones Ranking
# rD <- rsDriver(browser=c("firefox"), verbose = F, chromever = NULL)
# 
# driver <- rD[["client"]] 
# url <- "https://www.rollingstone.com/music/music-lists/100-greatest-artists-147446/"
# 
# #Navigating to the URL
# driver$navigate(url)
# 
# #Allowing the page to load
# Sys.sleep(2)
# 
# #Rejecting cookies
# cookie <- driver$findElement(using = 'css', value = '#onetrust-reject-all-handler')
# cookie$clickElement()
# 
# #Getting to the bottom of the page
# # loading all the dynamic html
# webElem <- driver$findElement("css", "body")
# webElem$sendKeysToElement(list(key = "end"))
# 
# #Getting the full page html
# page_source <- driver$getPageSource()[[1]]
# 
# #Extracting the rankings
# html_elements_ranking <- read_html(page_source) %>%
#   html_nodes(".c-gallery-vertical-album__number") %>% html_text()
# 
# #Extracting the Artists for each ranking
# html_elements_titles <- read_html(page_source) %>%
#   html_nodes(".c-gallery-vertical-album__title") %>% html_text()
# 
# #Finding the load_more button
# load_more <- driver$findElement(using = 'css', value = ".c-gallery-vertical__load-button")
# load_more$clickElement()
# 
# #Going to the bottom of the page to load every artist
# webElem <- driver$findElement("css", "body")
# webElem$sendKeysToElement(list(key = "end"))
# 
# #Getting the full page html
# page_source <- driver$getPageSource()[[1]]
# 
# #Extracting the rankings
# html_elements_ranking <- c(html_elements_ranking, read_html(page_source) %>%
#   html_nodes(".c-gallery-vertical-album__number") %>% html_text())
# 
# #Extracting the Artists for each ranking
# html_elements_titles <- c(html_elements_titles, read_html(page_source) %>%
#   html_nodes(".c-gallery-vertical-album__title") %>% html_text())
# 
# #Closing the port 
# driver$close()
# rD$server$stop()
# 
# # close the associated Java processes if necessary:
# system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
# 
# #Transforming the data into a Dataframe
# rankings_df <- data.frame(ranking = html_elements_ranking, artist = html_elements_titles)
# 
# # Save the data frame as a RDS file
# #saveRDS(rankings_df, "./data/rs_ranking.rds")
# ############### Getting Spotify Info for Every Artist in the RS Ranking
# 
# #Reading the client ID and Secret to access the API
# readRenviron("api.env")
# client_id <- Sys.getenv("SPOTIFY_CLIENT_ID")
# client_secret <- Sys.getenv("SPOTIFY_CLIENT_SECRET")
# 
# #Generating the access_token
# access_token <- get_spotify_access_token(client_id, client_secret)
# 
# #Retrieving the information from the Rolling Stones Ranking
# rs_ranking <- readRDS("./data/rs_ranking.rds")
# 
# #Creating a DF to store the data
# artist_id_pop <- data.frame()
# 
# # Loop through each artist in the Rolling Stones Ranking
# for (artist in rs_ranking$artist){
#   q <- artist
#   print(q)
#   # Search for the artist on Spotify
#   # and getting only the first result
#   result <- search_spotify(
#     q,
#     type = "artist",
#     market = NULL,
#     limit = 1,
#     offset = 0,
#     include_external = NULL,
#     authorization = access_token,
#     include_meta_info = FALSE
#   )
#   
#   new_row <- data.frame(result)
#   
#   # Add artist name and Spotify data to the dataframe
#   new_row <- cbind(artist_rs = q, new_row)
#   artist_id_pop <- rbind(artist_id_pop, new_row)
# }
# 
# #Save the initial result as a RDS file to avoid having to make new calls to the API
# #saveRDS(artist_id_pop, file = "./data/artist_id_pop_spotify.rds")
# 
# #Loading the data that was saved
# artist_id_pop <- readRDS("./data/artist_id_pop_spotify.rds")
# 
# # Checking for differences on the artist of the Ranking
# # and the result retrieved from the API
# for (i in 1:nrow(artist_id_pop)){
#   if (artist_id_pop$name[i] != artist_id_pop$artist_rs[i]){
#   print(paste("different name on index", i))
#   print(paste("1 -",artist_id_pop$name[i]))
#   print(paste("2 -",artist_id_pop$artist_rs[i]))
#   }
# }
# 
# #Differences are minor or non relevant (eg. &, and; upper and lower case differences)
# #Only notable differences are:
# # - Santana and Carlos Santana -> Sticking to the id provided by spotify that is the verified artist
# # - Hank Williams brought the Hank Williams Jr singer. Must be replaced by Hank Williams with the artist ID of '1FClsNYBUoNFtGgzeG74dW'
# # - Differences on line 43 "1 - George Clinton & Parliament Funkadelic" - "2 - Parliament and Funkadelic"
# 
# #Retrieving the correct information on the wrong artists
# 
# #Getting the ids of the artists that are wrong
# #Parliament and Funkadelic
# par_fun_wrong_id <- artist_id_pop %>% filter(artist_rs == "Parliament and Funkadelic") %>% pull(id)
# #Hank Williams
# hank_wil_wrong_id <- artist_id_pop %>% filter(artist_rs == "Hank Williams") %>% pull(id)
# 
# #Doing the search again to check for better results
# result <- search_spotify(
#   "Parliament Funkadelic",
#   type = "artist",
#   market = NULL,
#   limit = 20,
#   offset = 0,
#   include_external = NULL,
#   authorization = get_spotify_access_token(),
#   include_meta_info = FALSE
# )
# 
# #Selecting the lines of the band Parliament and the band Funkadelic
# result_par <- subset(result, id == "5SMVzTJyKFJ7TUb46DglcH")
# new_row_p <- cbind(artist_rs = "Parliament and Funkadelic", result_par)
# result_fun <- subset(result, id == "450o9jw6AtiQlQkHCdH6Ru")
# new_row_f <- cbind(artist_rs = "Parliament and Funkadelic", result_fun)
# #Removing the line with the wrong result
# artist_id_pop <- subset(artist_id_pop, id != par_fun_wrong_id)
# #Adding the new lines
# new_row <- rbind(new_row_p, new_row_f)
# artist_id_pop <- rbind(artist_id_pop, new_row)
# 
# #Getting Hank Williams info
# result <- search_spotify(
#   "Hank Williams",
#   type = "artist",
#   market = NULL,
#   limit = 20,
#   offset = 0,
#   include_external = NULL,
#   authorization = get_spotify_access_token(),
#   include_meta_info = FALSE
# )
# 
# #Getting the row with the correct result
# result_hank_wil <- subset(result, id == "1FClsNYBUoNFtGgzeG74dW")
# new_row <- cbind(artist_rs = "Hank Williams", result_hank_wil)
# #Removing the row that was wrong
# artist_id_pop <- subset(artist_id_pop, id != hank_wil_wrong_id)
# artist_id_pop <- rbind(artist_id_pop, new_row)
# 
# #Saving the dataset as a RDS file
# #saveRDS(artist_id_pop, file = "./data/artist_id_pop.rds")
# 
# #Retrieving the data 
# artist_id_pop <- readRDS("./data/artist_id_pop.rds")
# 
# # Merging the two dataframes, to get one dataset with
# # with the Rolling Stones Ranking and the Spotify data
# artist_info_ranking <- merge(rs_ranking, artist_id_pop,
#                              by.x = "artist", by.y = "artist_rs",
#                              all.x = TRUE)
# 
# #Changing the names of columns
# artist_info_ranking <- artist_info_ranking %>%  rename(artist_rs = artist,
#                                                        artist_spotify = name)
# 
# # Reorder columns
# artist_info_ranking <- artist_info_ranking %>%
#   select(ranking, artist_rs, artist_spotify, id, everything())
# 
# artist_info_ranking <- artist_info_ranking[order(artist_info_ranking$ranking), ]
# 
# #Correcting the name on the artist_rs column for Parliament and Funkadelic
# artist_info_ranking <- artist_info_ranking %>%
#   mutate(artist_rs = ifelse(artist_rs == "Parliament and Funkadelic",
#                             paste0("Parliament and Funkadelic (", artist_spotify, ")"),
#                             artist_rs))
# 
# # Save the merged dataframe as an RDS file
# #saveRDS(artist_info_ranking, file = "./data/artist_info_ranking.rds")
# ############### Getting Top Tracks for Every Artist in the RS Ranking
# 
# #Retrieving the necessary RDS File
# artist_info_ranking <- readRDS("./data/artist_info_ranking.rds")
# 
# #Getting the top tracks of each artist in the us
# #Setting the access_token
# access_token <- get_spotify_access_token(client_id, client_secret)
# 
# # Creating an empty dataframe to store top tracks
# top_tracks_us <- data.frame()
# 
# # Loop through each artist in the ranking dataframe
# for (i in 1:nrow(artist_info_ranking)){
#    # Get the Spotify ID of the artist
#   id <- artist_info_ranking$id[i]
#   
#   # Get the top tracks of the artist in the US
#   result <- get_artist_top_tracks(
#     id,
#     market = "US",
#     authorization = access_token,
#     include_meta_info = FALSE
#   )
#   # Print the artist's name for visibility
#   print(artist_info_ranking$artist_rs[i])
# 
#   # Assume that the main artist of the track is the one we are looking for
#   result$artists <- id
#   
#   # Include the name of the artist for better readability
#   result$artist_spotify <- artist_info_ranking$artist_spotify[i]
#   
#   # Append the new rows to the dataframe
#   new_rows <- result
#   top_tracks_us <- rbind(top_tracks_us, new_rows)
# }
# 
# # Changing column names for consistency
# for (name in c("id", "name", "popularity", "preview_url")){
#   colnames(top_tracks_us)[colnames(top_tracks_us) == name] <- paste0("track_",name)
# }
# 
# # Renaming the "artists" column to "artist_id"
# colnames(top_tracks_us)[colnames(top_tracks_us) == "artists"] <- "artist_id"
# 
# # Reordering columns for better readability
# top_tracks_us <- top_tracks_us %>%
#   select(artist_spotify, artist_id, track_name, track_popularity, track_id, everything())
# 
# # Saving the dataframe as an RDS file
# #saveRDS(top_tracks_us, file = "./data/top_tracks_us.rds")

# ############### Getting audio features for all the top tracks US
# 
# 
# # Setting the access_token
# access_token <- get_spotify_access_token(client_id, client_secret)
# 
# # Creating an empty dataframe to store audio features
# audio_features_us <- data.frame()
# 
# # Initializing vectors to store track information
# track_list <- c(NULL)
# artist_id <- c(NULL)
# artist_spotify <- c(NULL)
# 
# # Loop through each track in the top tracks US dataframe
# for (i in 1:nrow(top_tracks_us)){
#   # Get the track ID
#   track <- top_tracks_us$track_id[i]
#   
#   # Append track information to vectors
#   track_list <- c(track_list, track)
#   artist_id <- c(artist_id, top_tracks_us$artist_id[i])
#   artist_spotify <- c(artist_spotify, top_tracks_us$artist_spotify[i])
#   
#   # If 100 tracks are accumulated or it's the last iteration, get audio features
#   if (length(track_list) == 100 | i == nrow(top_tracks_us)){
#     result <- get_track_audio_features(track_list,
#                                        authorization = access_token)
#     
#     # Combine audio features with artist information
#     new_rows <- cbind(result, artist_id)
#     new_rows <- cbind(new_rows, artist_spotify)
#     
#     # Append the new rows to the dataframe
#     audio_features_us <- rbind(audio_features_us, new_rows)
#     
#     # Restarting the track list and related vectors
#     track_list <- c(NULL)
#     artist_id <- c(NULL)
#     artist_spotify <- c(NULL)
#   }
# }
# 
# # Renaming the "id" column to "track_id"
# colnames(audio_features_us)[colnames(audio_features_us) == "id"] <- "track_id"
# 
# # Saving the dataframe as an RDS file
# #saveRDS(audio_features_us, file = "./data/audio_features_us.rds")
# ############### Scraping the music Brainz website
# ############### to get information on the end of each artist
# # 
# 
# # Reading the artist information from a saved RDS file
# artist_info_ranking <- readRDS("./data/artist_info_ranking.rds")
# 
# # Setting up a WebDriver for scraping MusicBrainz website
# rD <- rsDriver(browser=c("firefox"), verbose = F, chromever = NULL)
# driver <- rD[["client"]] 
# url <- "https://musicbrainz.org/"
# 
# # Navigating to the MusicBrainz website
# driver$navigate(url)
# 
# # Creating empty dataframes to store MusicBrainz data
# mbdata <- data.frame()
# all_results <- data.frame()
# 
# # Looping through each artist in the Rolling Stones ranking
# for (artist in artist_info_ranking$artist_spotify){
#   # Find the search field and perform the search
#       
#   # Performing a search for each artist on MusicBrainz
#   while (TRUE){
#     # printing the artist that is being searched
#     print(artist)
#     
#     # Finding the search field and performing the search
#     search_field <- driver$findElement(using = 'css', value = '#headerid-query')
#     search_field$sendKeysToElement(list(artist))
#     search_field$sendKeysToElement(list(key = "enter"))
#         
#     # Giving time for the search results to show
#     Sys.sleep(2)
#     
#     # Getting the search results
#     search_results <- try(driver$findElement(using = 'css', value = ".tbl"), silent = TRUE)
#     
#     # If there is an internal error, try the search again
#     if ("try-error" %in% class(search_results)){
#       print("Trying search again")
#       Sys.sleep(2)
#     } else {break}
#   }
#   
#   # Extracting the outer HTML of the table
#   results_html <- search_results$getElementAttribute("outerHTML")
#   
#   #Reading the HTML element of the results
#   results_html <- read_html(results_html[[1]])
#   
#   #Extracting the table of the results
#   results_table <- html_table(results_html)[[1]]
#   
#   #selecting the first row as the correct result
#   new_row <- results_table[1,]
#   new_row <- cbind(artist_spotify = artist, new_row)
#   mbdata <- rbind(mbdata, new_row)
#   
#   #Saving all other results in case there is mismatch
#   result_rows <- cbind(artist_spotify = artist, results_table)
#   all_results <- rbind(all_results, result_rows)
# }
# 
# # Closing the port 
# driver$close()
# rD$server$stop()
# 
# # close the associated Java processes if necessary:
# system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
# 
# # Correcting the result for The Drifters that found a Taiwanese group
# the_drifters <- all_results[(all_results$artist == "The Drifters"),]
# #The correct result is the second row
# the_drifters <- the_drifters[2,]
# 
# # replacing the row in the mbdata dataframe
# # Finding the index of the original data
# drifters_index <- which(mbdata$artist == "The Drifters")
# # Update the corresponding row in mbdata with the corrected information
# mbdata[drifters_index, ] <- the_drifters
# 
# 
# # Treating the Begin and End dates to make them in a uniform format
# # If there is only the year and month, the day 01 is assigned
# # If there is only the year, the first of july of the year is assigned
# 
# mbdata$Begin <- ifelse(nchar(mbdata$Begin) == 4, paste0(mbdata$Begin, "-07-01"), mbdata$Begin)
# mbdata$Begin <- ifelse(nchar(mbdata$Begin) == 7, paste0(mbdata$Begin, "-01"), mbdata$Begin)
# 
# mbdata$End <- ifelse(is.na(mbdata$End), "", mbdata$End)
# mbdata$End <- ifelse(nchar(mbdata$End) == 4, paste0(mbdata$End, "-07-01"), mbdata$End)
# mbdata$End <- ifelse(nchar(mbdata$End) == 7, paste0(mbdata$End, "-01"), mbdata$End)
# 
# #Saving the mbdata as a RDS file to be loaded later
# #saveRDS(mbdata, ".\data\mbdata.rds")
# ############### Retrieving information of the Most Popular Artists today
# ############### Retrieving their top tracks and audio features on Spotify
# 
# 
# # Setting the URL for the webpage with Spotify's most popular artists
# url <- "https://chartmasters.org/spotify-most-popular-artists/"
# 
# # Reading HTML content from the webpage
# html_content <- read_html(url)
# 
# # Extracting tabular data from the HTML content
# tab <- html_table(html_content, fill = TRUE)
# popular_artists <- tab[[1]]
# 
# # Renaming columns for better readability
# colnames(popular_artists) <- c("rank",
#                                "artist_img",
#                                "artist_cm",
#                                "popularity",
#                                "weekly +/-",
#                                "daily Streams")
# 
# # Removing unnecessary columns
# popular_artists <- popular_artists %>% select(-artist_img, -"weekly +/-", - popularity)
# 
# ### Getting the Spotify id for these artists
# 
# # Getting the info to access the API
# readRenviron("api.env")
# client_id <- Sys.getenv("SPOTIFY_CLIENT_ID")
# client_secret <- Sys.getenv("SPOTIFY_CLIENT_SECRET")
# access_token <- get_spotify_access_token(client_id, client_secret)
# 
# # Creating dataframe to store information
# spotify_id <- data.frame()
# # Looping through each artist in the popular_artists dataframe
# for (artist in popular_artists$artist_cm){
#   q <- artist
#   print(q)
#   
#   result <- search_spotify(
#     q,
#     type = "artist",
#     market = NULL,
#     limit = 1,
#     offset = 0,
#     include_external = NULL,
#     authorization = access_token,
#     include_meta_info = FALSE
#   )
#   
#   # Creating a new row with the obtained result
#   new_row <- data.frame(result)
#   new_row <- cbind(artist_cm = q, new_row)
#   spotify_id <- rbind(spotify_id, new_row)
# }
# 
# # Selecting the relevant information
# spotify_id <- spotify_id %>% select(artist_cm, artist_spotify = name, artist_id = id, genres, popularity)
# 
# # Checking for different names
# indices <- which(spotify_id$artist_spotify != spotify_id$artist_cm)
# 
# if (length(indices) > 0) {
#   print(paste("Different name on indices:", indices))
#   print(paste("1 -", spotify_id$artist_spotify[indices]))
#   print(paste("2 -", spotify_id$artist_cm[indices]))
# }
# 
# # The differences don't seem relevant, no other measures needed
# 
# # Merging the two datasets (most popular artists and spotify IDs)
# spotify_id_subset <- spotify_id %>% select(artist_cm, artist_spotify, artist_id, artist_popularity = popularity)
# 
# popular_artists <- merge(popular_artists, spotify_id_subset, by = "artist_cm")
# 
# #Rearranging the columns
# popular_artists <- popular_artists %>% select(rank,
#                                                 artist_cm,
#                                                 artist_spotify,
#                                                 everything()) %>% arrange(rank)
# 
# # Saving the merged dataframe as an RDS file
# #saveRDS(popular_artists, "./data/popular_artists_cm.rds")
# 
# #loading the information
# popular_artists <- readRDS("./data/popular_artists_cm.rds")
# 
# #Getting the top tracks of each artist in the us
# #Setting the access_token
# access_token <- get_spotify_access_token(client_id, client_secret)
# 
# pop_top_tracks_us <- data.frame()
# # Looping through each artist in the popular_artists dataframe
# for (i in 1:nrow(popular_artists)){
#   id <- popular_artists$artist_id[i]
#   
#   # Getting the top tracks of the artist in the US
#   result <- get_artist_top_tracks(
#     id,
#     market = "US",
#     authorization = access_token,
#     include_meta_info = FALSE
#   )
#   print(popular_artists$artist_spotify[i])
#   
#   # Assuming that the main artist of the track is the one that we are looking for
#   result$artists <- id
#   
#   # Including the name of the artist for better readability
#   result$artist_spotify <- popular_artists$artist_spotify[i]
#   
#   # Appending the new rows to the dataframe
#   new_rows <- result
#   pop_top_tracks_us <- rbind(pop_top_tracks_us, new_rows)
# }
# 
# #selecting the relevant columns
# pop_top_tracks_us <- pop_top_tracks_us %>% select(artist_spotify,
#                                                   artist_id = artists,
#                                                   track_id = id,
#                                                   track_name = name,
#                                                   track_popularity = popularity,
#                                                   album.release_date,
#                                                   album.release_date_precision,
#                                                   )
# 
# # Getting audio features for all the top tracks in the US
# pop_audio_features_us <- data.frame()
# track_list <- c(NULL)
# artist_id_list <- c(NULL)
# artist_list <- c(NULL)
# 
# # Looping through each track in the pop_top_tracks_us dataframe
# for (i in 1:nrow(pop_top_tracks_us)){
#   
#   track <- pop_top_tracks_us$track_id[i]
#   track_list <- c(track_list, track)
#   artist_id_list <- c(artist_id_list, pop_top_tracks_us$artist_id[i])
#   artist_list <- c(artist_list, pop_top_tracks_us$artist_spotify[i])
#   
#   # If 100 tracks are accumulated or it's the last iteration, get audio features
#   if (length(track_list) == 100 | i == nrow(pop_top_tracks_us)){
#     result <- get_track_audio_features(track_list,
#                                        authorization = access_token)
#     
#     # Combining audio features with artist information
#     new_rows <- cbind(result, artist_id = artist_id_list)
#     new_rows <- cbind(new_rows, artist_spotify = artist_list)
#     
#     # Appending the new rows to the dataframe
#     pop_audio_features_us <- rbind(pop_audio_features_us, new_rows)
#     
#     # Restarting the track list and related vectors
#     track_list <- c(NULL)
#     artist_id_list <- c(NULL)
#     artist_list <- c(NULL)
#   }
# }
# # Renaming the "id" column to "track_id"
# colnames(pop_audio_features_us)[colnames(pop_audio_features_us) == "id"] <- "track_id"
# pop_audio_features_us_subset <- pop_audio_features_us %>% select (-artist_spotify)
# 
# # Merging the top tracks and the audio features
# pop_audio_features_us <- merge(pop_top_tracks_us,
#                                pop_audio_features_us_subset,
#                                by.x = c("track_id", "artist_id"),
#                                by.y = c("track_id", "artist_id"))
# 
# 
# # Saving the merged dataframe as an RDS file
# #saveRDS(pop_audio_features_us, "./data/popular_artists_audio_features.rds")
# # All the data has been retrieved and processed
# # Whenever needed the data will be directly loaded
# # Cleaning the environment for better performance
# 
# rm(list = ls())

#loading the needed datasets
artist_info_ranking <- readRDS("./data/artist_info_ranking.rds")

# Define artist popularity based on a threshold of 50
artist_info_ranking$is_popular <- ifelse(artist_info_ranking$popularity > 50,
                                         "Popular",
                                         "Not Popular")

# Create a new dataset to use in the plot
# with additional columns and arrange the data
plot_one_data <- artist_info_ranking %>%
  # New Column with followers in thousands and round to 2 decimal places
  mutate(followers_k = round(followers.total/1000, 2)) %>%
  
  # Convert artist_spotify to a factor
  mutate(artist_spotify = factor(artist_spotify, artist_spotify)) %>%
  
  # Prepare text for tooltip in interactive plot
  mutate(text = paste("Artist: ", artist_spotify,
                      "\nPopularity on Spotify: ", popularity,
                      "\nRanking Rolling Stones: ", ranking,
                      "\nFollowers (1,000): ", followers_k, sep=""))

# Create a ggplot scatter plot with a smooth line and additional formatting
p1 <- plot_one_data %>%
  ggplot() +
  geom_point(aes(x = ranking, y = popularity, color = is_popular, text = text), size = 3, alpha = 0.5) +
  geom_line(stat = "smooth", method = lm, aes(x = ranking, y = popularity), formula = 'y ~ x', color = "red", alpha = 0.5) +
  theme_minimal() +
  labs(title = "Rolling Stones Ranking x Spotify Popularity",
       color = NULL,
       x = "Rolling Stones Ranking",
       y = "Spotify Popularity") +
  ylim(0, 100) +
  xlim(0, 100) +
  geom_vline(xintercept = 0, color = "black", alpha = 0.1) +  # Add vertical line at x = 0
  geom_hline(yintercept = 0, color = "black", alpha = 0.1) +  # Add horizontal line at y = 0
  theme(legend.position = "bottom",
        panel.grid = element_line(color = alpha("gray", 0.2), linetype = "dashed"),  # Adjust legend position
        axis.text = element_text(size = 8),  # Adjust axis text size
        axis.title = element_text(size = 10),
        plot.title = element_text(hjust = 0.5))  # Adjust axis title size

# Convert ggplot plot to plotly for interactivity with tooltips
p1_int <- ggplotly(p1, tooltip = "text")
/* Centering the Plotly interactive plot*/

/*Source: https://stackoverflow.com/questions/47193192/r-markdown-and-plotly-fig-align-not-working-with-html-output*/

.center {
  display: table;
  margin-right: auto;
  margin-left: auto;
}
#Plotting the graph
p1_int
# Print a message
cat("Statistical Summary of Spotify Popularity\n\n")

# Print the summary of the 'popularity' variable
print(summary(artist_info_ranking$popularity))
# Read artist ranking information and top tracks data from RDS files
artist_info_ranking <- readRDS("./data/artist_info_ranking.rds")
top_tracks_us <- readRDS("./data/top_tracks_us.rds")

# Subset the necessary columns from artist_info_ranking and top_tracks_us
subset_artist <- artist_info_ranking %>% select(id, artist_spotify, popularity, genres)

subset_top_tracks <- top_tracks_us %>% select(artist_id, track_popularity, track_id, track_name, album.release_date)

# Merge the two datasets based on artist IDs
merged_data_plot2 <- merge(subset_artist, subset_top_tracks, by.x = "id", by.y = "artist_id")

# Create mean track popularity data for plotting
mean_data <- merged_data_plot2 %>%
  group_by(artist_spotify) %>%
  summarise(mean_track_popularity = mean(track_popularity))

# Merge mean data with the original data
merged_data_plot2 <- merge(merged_data_plot2, mean_data, by = "artist_spotify")

# Factoring the artist_spotify for plotting in order of artist popularity
popularity_order <- artist_info_ranking %>% select(artist_spotify, popularity) %>% arrange(desc(popularity))

merged_data_plot2$artist_spotify <- factor(merged_data_plot2$artist_spotify, levels = popularity_order$artist_spotify)
# Create a ggplot with three sets of points representing different popularity metrics
p2 <- ggplot(merged_data_plot2, aes(y = artist_spotify)) +
  geom_point(aes(x = track_popularity, color = "Track Popularity"), size = 1.5, alpha = 0.7) +
  geom_point(aes(x = popularity, color = "Artist Popularity"), size = 1.5, alpha = 0.7) +
  geom_point(aes(x = mean_track_popularity, color = "Average Track Popularity"), size = 1.5, alpha = 0.7) +
  theme_minimal() +
  labs(title = "Artist and Top Tracks (US) Popularity",
       y = NULL,
       x = "Popularity",
       subtitle = "Artists are in Ascending Popularity Order") +
  scale_color_manual(
    values = c("#E63946", "#FBAF4F", "#BFD3C1"),
    name = NULL
  ) +
  theme(axis.text.x = element_text(hjust = 1, vjust = 1),
        panel.grid = element_line(color = alpha("gray", 0.2), linetype = "solid"),
        panel.grid.major.x = element_line(linetype = "dashed"),
        panel.grid.minor.x = element_line(linetype = "dashed"),
        axis.text.y = element_text(hjust = 1, vjust = 0.5),
        plot.title = element_text(hjust = 0.5, size = 16),
        plot.subtitle = element_text(hjust = 0.5, size = 11),
        legend.position = "top") +
  geom_vline(xintercept = 0, color = "black", alpha = 0.2) +  # Add vertical line at x = 0
  scale_x_continuous(breaks = c(0, 25, 50, 75, 100),
                     expand = c(0, 0),
                     limits = c(-1, 101),
                     sec.axis = sec_axis(~., name = "Popularity"))
  
# Display the plot
p2
#Assign tracks that have positive outliers using IQR
merged_data_outlier <- merged_data_plot2 %>%
  group_by(artist_spotify) %>%
    mutate(
    Q1 = quantile(track_popularity, 0.25),
    Q3 = quantile(track_popularity, 0.75),
    IQR_value = Q3 - Q1,
    upper_bound = Q3 + 1.5 * IQR_value,
    outliers = ifelse(track_popularity > upper_bound, "Positive Outlier", "Not Outlier"),
    mean_popularity = mean(track_popularity),
    diff_to_mean = track_popularity - mean_popularity
  ) %>%
  ungroup()

outlier_summary <- merged_data_outlier %>%
  filter(outliers == "Positive Outlier") %>%
  summarise("Total Positive Outliers" = n(), 
            "Unique Artists With Positive Outliers" = length(unique(artist_spotify)),
             "Mean Artist Popularity" = mean(popularity),
            "Median Artist Popularity" = median(popularity),
            "Mean Difference to Average Track Popularity" = mean(diff_to_mean),
            "Max Difference to Average Track Popularity" = max(diff_to_mean))

# Convert the summary table to a nicely formatted table
kable(outlier_summary, "simple", align = 'c', caption = "Positive Outliers Summary")
artist_info_ranking <- readRDS("./data/artist_info_ranking.rds")
top_tracks_us <- readRDS("./data/top_tracks_us.rds")
mbdata <- readRDS("./data/mbdata.rds")

#Selecting the necessary variables from each of the datasets
subset_mbdata <- mbdata %>% select(artist_spotify, Begin, End)
subset_artist_info_ranking <- artist_info_ranking %>% select(artist_spotify, artist_id = id, popularity)
subset_top_tracks_us <- top_tracks_us %>% select(artist_spotify, artist_id, track_name, track_id, track_popularity, album.release_date)

merge_data_p3 <- merge(subset_mbdata, subset_artist_info_ranking, by = "artist_spotify")

merge_data_p3 <- merge(merge_data_p3, subset_top_tracks_us, by = "artist_id")

# Convert dates to Date objects
merge_data_p3$Begin <- as.Date(merge_data_p3$Begin)
merge_data_p3$End <- as.Date(merge_data_p3$End)
merge_data_p3$album.release_date <- as.Date(merge_data_p3$album.release_date)

merge_data_p3$band_ended <- ifelse(merge_data_p3$End < merge_data_p3$album.release_date, "Release of Non Active Artist", "Regular Release")

merge_data_p3$band_ended[is.na(merge_data_p3$band_ended)] <- "Regular Release"

merge_data_p3$is_pop <- ifelse(merge_data_p3$popularity > 50, "Popular Artist", "Unpopular Artist")

average_popularity <- merge_data_p3 %>%
  group_by(band_ended) %>%
  summarise(avg_popularity = mean(popularity))

plot_data <- merge_data_p3 %>% filter(!is.na(album.release_date))
# Create the plot for band_ended facet
p2 <- ggplot(plot_data %>% filter(is_pop == "Popular Artist"), aes(x = album.release_date, y = track_popularity, color = band_ended)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_x_date(date_breaks = "5 years", date_labels = "%Y", limits = c(as.Date("1958-01-01"), as.Date("2023-12-12") )) +
  labs(title = "Track Popularity Over Album Release Dates",
       subtitle = "'Popular' Artists",
       x = "Album Release Date",
       y = "Track Popularity",
       color = "") +
  theme_minimal() +
  ylim(0,100) +
  geom_smooth(method = "lm", se = FALSE) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_line(linetype = "dashed"),
        panel.grid.major.y = element_line(linetype = "dashed"),
        plot.title = element_text(size = 16, hjust = 0.5, color = "black", face = "bold"),
        plot.subtitle = element_text(size = 12, hjust = 0.5, color = "black")) +
  geom_hline(aes(yintercept = 0), linetype = "solid", color = "black", alpha = 0.5)

p3 <- ggplot(plot_data %>% filter(is_pop != "Popular Artist"), aes(x = album.release_date, y = track_popularity, color = band_ended)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_x_date(date_breaks = "5 years", date_labels = "%Y", limits = c(as.Date("1958-01-01"), as.Date("2023-12-12") )) +
  labs(title = "Track Popularity Over Album Release Dates",
       subtitle = "'Unpopular' Artists",
       x = "Album Release Date",
       y = "Track Popularity",
       color = "") +
  theme_minimal() +
  ylim(0,100) +
  #xlim(as.Date("1960-01-01"), NA) +
  geom_smooth(method = "lm", se = FALSE)+
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_line(linetype = "dashed"),
        panel.grid.major.y = element_line(linetype = "dashed"),
        plot.title = element_text(size = 16, hjust = 0.5, color = "black", face = "bold"),
        plot.subtitle = element_text(size = 12, hjust = 0.5, color = "black")) +
  geom_hline(aes(yintercept = 0), linetype = "solid", color = "black", alpha = 0.5)

# Display the plots side by side
library(gridExtra)
grid.arrange(p2, p3, ncol = 2)

# Merge the dataframes based on artist_spotify
artist_info_ranking <- readRDS("./data/artist_info_ranking.rds")
top_tracks_us <- readRDS("./data/top_tracks_us.rds")
audio_features_us <- readRDS("./data/audio_features_us.rds")

#First we need to merge the artist_info_ranking and the audio_feature_us to determine the popular and non popular artists
artist_info_ranking_subset <- artist_info_ranking %>% select(artist_id = id, popularity)

audio_features_us_merged <- merge(audio_features_us, artist_info_ranking_subset, by = "artist_id") 

#Creating the column is_pop in the audio_features_us_merged
audio_features_us_merged$is_pop <- ifelse(audio_features_us_merged$popularity > 50, "Popular", "Unpopular")

#Importing the data from the popular artists of 2023 audio features
popular_art_aud_feat <- readRDS("./data/popular_artists_audio_features.rds")
#Creating a column to indicate that these are the most popular artists
popular_art_aud_feat$is_pop <- "Most Popular Today"

#Selecting the columns that will be analised
subset_cols <- c("artist_spotify",
                 "track_id",
                 "artist_id",
                 "is_pop",
                 "danceability",
                 "energy",
                 "loudness",
                 "mode",
                 "speechiness",
                 "acousticness",
                 "instrumentalness",
                 "liveness",
                 "valence",
                 "tempo")

#Creating the subsets that will be merged to be plotted
audio_features_us_subset <- audio_features_us_merged %>% select(all_of(subset_cols))
popular_art_aud_feat_subset <- popular_art_aud_feat %>% select(all_of(subset_cols))

#The dfs have the same columns, so we will just rbind them
audio_feature_plot_data <- rbind(audio_features_us_subset, popular_art_aud_feat_subset)

# Select relevant columns for comparison
comparison_cols <- c("danceability",
                 "energy",
                 "loudness",
                 "speechiness",
                 "acousticness",
                 "liveness",
                 "valence",
                 "tempo")
custom_colors <- c("#1F78B4", "#FF7F00", "#33A02C")

#Plotting based on artist popularity
plots <- lapply(comparison_cols, function(col) {
  ggplot(audio_feature_plot_data, aes(x = get(col), fill = is_pop)) +
    geom_density(alpha = 0.3) +
    scale_fill_manual(values = custom_colors) +
    labs(title = paste("Density Plot of", toTitleCase(as.character(col))),
         x = col,
         fill = "Artist Popularity") +
    theme_minimal()+
    theme(legend.position = "bottom",
          plot.title = element_text(hjust = 0.5))
})
# Arrange plots in a grid with 3 columns
grid_plots <- wrap_plots(plots, ncol = 2)

# Display the grid of plots
print(grid_plots)
# this chunk generates the complete code appendix. 
# eval=FALSE tells R not to run (``evaluate'') the code here (it was already run before).